{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "t90y1ms_7_ML"
   },
   "source": [
    "# Gradient Boosting From Scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "m7TY2Grc8B9J"
   },
   "source": [
    "Let's implement gradient boosting from scratch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "XiSFTj9lssWl"
   },
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from tensorflow.keras.datasets import boston_housing\n",
    "\n",
    "np.random.seed(0)\n",
    "\n",
    "plt.rcParams['figure.figsize'] = (8.0, 5.0)\n",
    "plt.rcParams['axes.labelsize'] = 20\n",
    "plt.rcParams['ytick.labelsize'] = 14\n",
    "plt.rcParams['xtick.labelsize'] = 14"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 228
    },
    "colab_type": "code",
    "id": "xbsRYs7SDZHr",
    "outputId": "feea96a6-f213-4ad3-aba5-c71b3b2364f4"
   },
   "outputs": [],
   "source": [
    "(x_train, y_train), (x_test, y_test) = boston_housing.load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "PNb6fTlVPPkG"
   },
   "source": [
    "Let explore the data before building a model. The goal is to predict the median value of owner-occupied homes in $1000s.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "y-D1SsHhPYUq"
   },
   "outputs": [],
   "source": [
    "# Create training/test dataframes for visualization/data exploration.\n",
    "# Description of features: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html\n",
    "feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD','TAX', 'PTRATIO', 'B', 'LSTAT']\n",
    "df_train = pd.DataFrame(x_train, columns=feature_names)\n",
    "df_test = pd.DataFrame(x_test, columns=feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise #1: What are the most predictive features? Determine correlation for each feature with the label. You may find the [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) function useful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zEs8r-YJ2qgc"
   },
   "source": [
    "## Train Gradient Boosting model\n",
    "\n",
    "Training Steps to build model an ensemble of $K$ estimators.\n",
    "1. At $k=0$ build base model ,  $\\hat{y}_{0}$: $\\hat{y}_{0}=base\\_predicted$\n",
    "3. Compute residuals $r = \\sum_{i=0}^n (y_{k,i} - \\hat{y}_{k,i})$; $n: number\\ train\\ examples$\n",
    "4. Train new model, fitting on residuals, $r$. We will call the predictions from this model $e_{k}\\_predicted$\n",
    "5. Update model predictions at step $k$ by adding residual to current predictions: $\\hat{y}_{k} = \\hat{y}_{k-1} + e_{k}\\_predicted$\n",
    "6. Repeat steps 2 - 5 `K` times.\n",
    "\n",
    "In summary, the goal is to build K estimators that learn to predict the residuals from the prior model; thus we are learning to \"correct\" the\n",
    "predictions up until this point.\n",
    "<br>\n",
    "\n",
    "$\\hat{y}_{K} = base\\_predicted\\ +\\ \\sum_{j=1}^Ke_{j}\\_predicted$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZOa5-a_8L_oA"
   },
   "source": [
    "### Build base model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise #2: Make an initial prediction using the `BaseModel` class -- configure the `predict` method to predict the training mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "iE7357JRL-Xv"
   },
   "outputs": [],
   "source": [
    "class BaseModel(object):\n",
    "    \"\"\"Initial model that predicts mean of train set.\"\"\"\n",
    "\n",
    "    def __init__(self, y_train):\n",
    "        self.train_mean = # TODO\n",
    "\n",
    "    def predict(self, x):\n",
    "        \"\"\"Return train mean for every prediction.\"\"\"\n",
    "        return # TODO\n",
    "\n",
    "def compute_residuals(label, pred):\n",
    "    \"\"\"Compute difference of labels and predictions.\n",
    "\n",
    "    When using mean squared error loss function, the residual indicates the \n",
    "    negative gradient of the loss function in prediction space. Thus by fitting\n",
    "    the residuals, we performing gradient descent in prediction space. See for\n",
    "    more detail:\n",
    "\n",
    "    https://explained.ai/gradient-boosting/L2-loss.html\n",
    "    \"\"\"\n",
    "    return label - pred\n",
    "\n",
    "def compute_rmse(x):\n",
    "    return np.sqrt(np.mean(np.square(x)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "colab_type": "code",
    "id": "kzdhD6Su2LbF",
    "outputId": "64ed7c2f-946b-4ffe-fe34-5887edf2ad0f"
   },
   "outputs": [],
   "source": [
    "# Build a base model that predicts the mean of the training set.\n",
    "base_model = BaseModel(y_train)\n",
    "test_pred = base_model.predict(x_test)\n",
    "test_residuals = compute_residuals(y_test, test_pred)\n",
    "compute_rmse(test_residuals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "28YAYLk4e1IU"
   },
   "source": [
    "Let's see how the base model performs on out test data. Let's visualize performance compared to the `LSTAT` feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 351
    },
    "colab_type": "code",
    "id": "a-4Ob3GcfFC8",
    "outputId": "932f10f0-ad9f-466e-c099-1e5a2f868fb1"
   },
   "outputs": [],
   "source": [
    "feature = df_test.LSTAT\n",
    "\n",
    "# Pick a predictive feature for plotting.\n",
    "plt.plot(feature, y_test, 'go', alpha=0.7, markersize=10)\n",
    "plt.plot(feature, test_pred, label='initial prediction')\n",
    "\n",
    "plt.xlabel('LSTAT', size=20)\n",
    "plt.legend(prop={'size': 20});"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XnxjXH4WfFSI"
   },
   "source": [
    "There is definitely room for improvement. We can also look at the residuals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 351
    },
    "colab_type": "code",
    "id": "2n2D88oXfx13",
    "outputId": "1027304d-04d6-4364-d2e8-dde706e53720"
   },
   "outputs": [],
   "source": [
    "plt.plot(feature, test_residuals, 'bo', alpha=0.7, markersize=10)\n",
    "plt.ylabel('residuals', size=20)\n",
    "plt.xlabel('LSTAT', size=20)\n",
    "plt.plot([feature.min(), feature.max()], [0, 0], 'b--', label='0 error');\n",
    "plt.legend(prop={'size': 20});"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AFlqdEN7Uap8"
   },
   "source": [
    "### Train Boosting model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TrqvuB54Uub7"
   },
   "source": [
    "Returning back to boosting, let's use our very first base model as are initial prediction. We'll then perform subsequent boosting iterations to improve upon this model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9KHXsCZMgaEo"
   },
   "source": [
    "create_weak_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qJDbirEsaYCN"
   },
   "outputs": [],
   "source": [
    "def create_weak_learner(**tree_params):\n",
    "    \"\"\"Initialize a Decision Tree model.\"\"\"\n",
    "    model = DecisionTreeRegressor(**tree_params)\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "R8kCPP6Wfamg"
   },
   "source": [
    "Make initial prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise #3: Update the prediction on the training set (`train_pred`) and on the testing set (`test_pred`) using the weak learner that predicts the residuals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base_model = BaseModel(y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Training parameters. \n",
    "tree_params = {\n",
    "    'max_depth': 1,\n",
    "    'criterion': 'mse',\n",
    "    'random_state': 123\n",
    "  }\n",
    "N_ESTIMATORS = 50\n",
    "BOOSTING_LR = 0.1\n",
    "\n",
    "# Initial prediction, residuals.\n",
    "train_pred = base_model.prediction(x_train)\n",
    "test_pred = base_model.prediction(x_test)\n",
    "train_residuals = compute_residuals(y_train, train_pred)\n",
    "test_residuals = compute_residuals(y_test, test_pred)\n",
    "\n",
    "# Boosting.\n",
    "train_rmse, test_rmse = [], []\n",
    "for _ in range(0, N_ESTIMATORS):\n",
    "    train_rmse.append(compute_rmse(train_residuals))\n",
    "    test_rmse.append(compute_rmse(test_residuals))\n",
    "    # Train weak learner.\n",
    "    model = create_weak_learner(**tree_params)\n",
    "    model.fit(x_train, train_residuals)\n",
    "    # Boosting magic happens here: add the residual prediction to correct\n",
    "    # the prior model.\n",
    "    grad_approx = # TODO\n",
    "    train_pred += # TODO\n",
    "    train_residuals = compute_residuals(y_train, train_pred)  \n",
    "    \n",
    "    # Keep track of residuals on validation set.\n",
    "    grad_approx = # TODO\n",
    "    test_pred += # TODO\n",
    "    test_residuals = compute_residuals(y_test, test_pred)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XUhgTI_BH7Qk"
   },
   "source": [
    "## Interpret results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "nhCF--TZg2BJ"
   },
   "source": [
    "Can you improve the model results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "plt.plot(train_rmse, label='train error')\n",
    "plt.plot(test_rmse, label='test error')\n",
    "plt.ylabel('rmse', size=20)\n",
    "plt.xlabel('Boosting Iterations', size=20);\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "rHwfvd5JKA2m"
   },
   "source": [
    "## Let's visualize how the performance changes across iterations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 643
    },
    "colab_type": "code",
    "id": "tAM1X70onv6w",
    "outputId": "d54c68e6-2702-4b8f-80a7-3e832aee2823"
   },
   "outputs": [],
   "source": [
    "feature = df_test.LSTAT\n",
    "ix = np.argsort(feature)\n",
    "\n",
    "# Pick a predictive feature for plotting.\n",
    "plt.plot(feature, y_test, 'go', alpha=0.7, markersize=10)\n",
    "plt.plot(feature[ix], test_pred[ix], label='boosted prediction', linewidth=2)\n",
    "\n",
    "plt.xlabel('feature', size=20)\n",
    "plt.legend(prop={'size': 20});"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "ASL_a_boosting_from_scratch.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
